Session 2: Data Structures and Wrangling

ESS: Introduction to Web Scraping and Data Management for Social Scientists

Johannes B. Gruber

2025-07-08

Introduction

This Course

tinytable_7p1oigx582p6qsuc0cmk
Day Session
1 Introduction
2 Data Structures and Wrangling
3 Working with Files
4 Linking and joining data & SQL
5 Scaling, Reporting and Database Software
6 Introduction to the Web
7 Static Web Pages
8 Application Programming Interface (APIs)
9 Interactive Web Pages
10 Building a Reproducible Research Project

The Plan for Today

In this session, you learn:

  • how data plays into the research process
  • the difference between content and structure of data
  • about the basic data structures in R and what they are good for
  • how to turn information into data
  • the key role of tables
  • and how to turn bad data structures into good tables

Zane Lee via unsplash.com

Data Data Data

The research process

Can be broadly separated into three steps:

  1. Data Collection
  2. Data Processing
  3. Data Analysis

Source: Weidmann (2023)

The research process

Source: Wickham, Çetinkaya-Rundel, and Grolemund (2023)

The research process

Can be broadly separated into three steps (Weidmann 2023):

  1. Data Collection – next week
  2. Data Processing – this week
  3. Data Analysis – only examples

Why is Data Processing important?

  • Data to answer most interesting questions is almost never available in the format you need it–otherwise someone would have done it already
  • Being able to combine and wrangle data into the shape you need makes you wanted inside and outside academia
  • Good Data Processing strategies will let you sleep easy at night:
    • manageable: Avoiding the REAL drudge work of hand-cleaning data by building pipelines to automate
    • scalable: Extending your data collection to new cases and scopes is easy
    • transparent: It will be easy to communicate (or remember) later
    • robust: It won’t fail if you draw out a JENGA stone

Your turn (Exercises 1)

  • What was your worst experience with data?

What is Data?

Definition

  • systematic and structured way of representing information (about the real world)
  • consist of content + structure
    • content: WHAT is stored, the information itself (e.g., numbers, text, images, sound or videos)
    • structure: HOW information is stored
    • structure makes information (easily) readable by a computer

Source: Weidmann (2023)

Core Data Structures in R

  • Logical: Represents Boolean values, either TRUE or FALSE.

  • Integer: Specifically represents integer values. Example: 42L (The L indicates that the number is an integer)

  • Numeric or Double: Represents decimal numbers or real numbers. Example: 42, 3.14

  • Character: Represents text or string values. Example: "hello"

  • Factor: Represents categorical data, data is stored as codes and levels. Example: factor(c("male", "female"))

  • Date: Represents dates. Example: as.Date("2025-07-08")

  • POSIXct / POSIXlt: Represents date-time objects. Examples: as.POSIXct("2025-07-08 12:34:56"), as.POSIXlt("2025-07-08 12:34:56")

  • Raw: Represents raw bytes, often used for binary data. Example: charToRaw("A")

  • Matrix: A 2-dimensional array where all elements must be of the same type (numeric, character, etc.). Example:

    my_matrix <- matrix(
      data = 1:9,
      nrow = 3,
      ncol = 3
    )

Nested Data Structures in R

  • Data Frame: A table or 2-dimensional array-like structure where each column can contain different types of data (numeric, character, factor, etc.). Example:

    data <- data.frame(
      name = c("Alice", "Bob", "Charlie"),
      age = c(25, 30, 35),
      gender = factor(c("F", "M", "M"))
    )
  • List: A versatile data structure that can contain elements of different types, including vectors, data frames, and even other lists. Example:

    my_list <- list(
      name = "Alice",
      age = 25,
      scores = c(85, 90, 88),
      details = list(address = "123 Main St", city = "Anytown")
    )

Your turn (Exercises 2)

  • Did I forget any data structures?
  • head to https://www.menti.com/ and vote with code “6800 3260”

From information to structured data

Simple Example

ex1 <- "United Kingdom is a country with 67 million inhabitants, its capital is London."
library(stringr)
country     <- str_extract(ex1, "^.+?(?= is)")
inhabitants <- str_extract(ex1, "\\d+?(?= million)")
capital     <- str_extract(ex1, "(?<=capital is ).+?(?=\\.)")
data.frame(
  country,    
  inhabitants,
  capital
)
         country inhabitants capital
1 United Kingdom          67  London

But what are these strange symbols?

Regular Expression: your friend when working with strings

  • Regular expressions are “a very terse language that allow you to describe patterns in strings” (Wickham, Çetinkaya-Rundel, and Grolemund 2023)

  • String: sequences of characters, for instance, letters and punctuation (stored as character in R). Example: “Headline”

  • Regular expressions: string patterns with non-literal meaning that tells computer what to look for:

    strings <- c("headline", "Headline")
    str_view(strings, "[Hh]")
    [1] │ <h>eadline
    [2] │ <H>eadline
    str_view(strings, "[Hh]eadline")
    [1] │ <headline>
    [2] │ <Headline>

    Explanation: [Hh] characters inside square brackets are matched

Regular Expression: your “friend” when working with strings

Regular expression syntax (adapted from Atteveldt, Trilling, and Arcíla (2021))
Specifier: What to match Syntax Example Matches
All characters except for new lines . d.g dig, d!g
Word characters (letters, digits,_) \\w d\\wg dig, dog, d2g
Digits (0 to 9) \\d 202\\d 2020, 2021
Whitespace (space, tab, newline) \\s
Newline \n
Beginning of the string ^ ^go gogo go
Ending of the string $ go$ go gogo
Beginning or end of word \\b \\bword\\b a word! And two words
Either first or second option …|… cat|dog A cat is chased by a dog
Quantifier: How many to match
Zero or more * d.*g dg, drag, d = g
Zero or more (non-greedy) *? d.*?g dogg
One or more + \\d+% 1%, an increase from 25% to 30%
One or more (non-greedy) +? \\d+% an increase from 25% to 30%
Zero or one ? colou?r color, colour
Exactly n times {n} \\d{4} 1940, 2020
At least n times {n,} \\d{3,} 194, 2020 but not 19
Between n and m times {n,m} \\d{3,4} 194, 2020 but not 19, 20200
Other constructs
Groups (…) '(bla ){2}' bla bla bla
Back references \\1 str_replace("this", "(th)is", "\\1ose") produces “those”
Selection of characters […] d[ieo]g dig, dog, dug
Range of characters in selection [A-z] [Hhu] Hi, how are you
Everything except selection [^...] [^Hhl] Hi, how are you
Escape special character \\ 3\\.14 3.14 but not 3,14
Lookahead (?=) thi(?=s) this
Lookbehind (?<=) (?<=t)his this
Negative lookahead (?!) thi(?!s) thin, thistle
Negative lookbehind (?<!) (?<!t)his whisper, thistle

Regular Expression: important functions from stringr

  • str_replace_all(): replaces with different string
  • str_remove_all(): removes completely
  • str_extract_all(): extracts pattern matching regular expression
  • str_detect(): returns a logical vector that is TRUE when regular expression matches
  • str_subset(): returns whole element when regular expression matches
  • str_sub(): returns, e.g., 4th until the 8th character
  • fixed(): turns of search for regular expressions

Learn more:

Longer Example

data_input <- "
Albania is a country with 2.8 million inhabitants, its capital is Tirana, and it was founded on 28 November 1912.
Andorra is a country with 77,000 inhabitants, its capital is Andorra la Vella, and it was founded on 8 September 1278.
Austria is a country with 8.9 million inhabitants, its capital is Vienna, and it was founded on 12 November 1918.
Belarus is a country with 9.5 million inhabitants, its capital is Minsk, and it was founded on 25 August 1991.
Belgium is a country with 11.5 million inhabitants, its capital is Brussels, and it was founded on 4 October 1830.
Bosnia and Herzegovina is a country with 3.3 million inhabitants, its capital is Sarajevo, and it was founded on 1 March 1992.
Bulgaria is a country with 6.9 million inhabitants, its capital is Sofia, and it was founded on 22 September 1908.
Croatia is a country with 4 million inhabitants, its capital is Zagreb, and it was founded on 25 June 1991.
Cyprus is a country with 1.2 million inhabitants, its capital is Nicosia, and it was founded on 16 August 1960.
Czech Republic is a country with 10.7 million inhabitants, its capital is Prague, and it was founded on 1 January 1993.
Denmark is a country with 5.8 million inhabitants, its capital is Copenhagen, and it was founded on 5 June 1849.
Estonia is a country with 1.3 million inhabitants, its capital is Tallinn, and it was founded on 20 August 1991.
Finland is a country with 5.5 million inhabitants, its capital is Helsinki, and it was founded on 6 December 1917.
France is a country with 67 million inhabitants, its capital is Paris, and it was founded on 22 September 1792.
Germany is a country with 83 million inhabitants, its capital is Berlin, and it was founded on 3 October 1990.
Greece is a country with 10.4 million inhabitants, its capital is Athens, and it was founded on 25 March 1821.
Hungary is a country with 9.6 million inhabitants, its capital is Budapest, and it was founded on 23 October 1989.
Iceland is a country with 366,000 inhabitants, its capital is Reykjavik, and it was founded on 17 June 1944.
Ireland is a country with 5 million inhabitants, its capital is Dublin, and it was founded on 6 December 1922.
Italy is a country with 60 million inhabitants, its capital is Rome, and it was founded on 17 March 1861.
Kosovo is a country with 1.8 million inhabitants, its capital is Pristina, and it was founded on 17 February 2008.
Latvia is a country with 1.9 million inhabitants, its capital is Riga, and it was founded on 18 November 1918.
Liechtenstein is a country with 39,000 inhabitants, its capital is Vaduz, and it was founded on 23 January 1719.
Lithuania is a country with 2.8 million inhabitants, its capital is Vilnius, and it was founded on 11 March 1990.
Luxembourg is a country with 634,000 inhabitants, its capital is Luxembourg City, and it was founded on 9 June 1815.
Malta is a country with 514,000 inhabitants, its capital is Valletta, and it was founded on 21 September 1964.
Moldova is a country with 2.6 million inhabitants, its capital is Chișinău, and it was founded on 27 August 1991.
Monaco is a country with 39,000 inhabitants, its capital is Monaco, and it was founded on 8 January 1297.
Montenegro is a country with 622,000 inhabitants, its capital is Podgorica, and it was founded on 3 June 2006.
Netherlands is a country with 17.4 million inhabitants, its capital is Amsterdam, and it was founded on 26 July 1581.
North Macedonia is a country with 2.1 million inhabitants, its capital is Skopje, and it was founded on 8 September 1991.
Norway is a country with 5.4 million inhabitants, its capital is Oslo, and it was founded on 7 June 1905.
Poland is a country with 38 million inhabitants, its capital is Warsaw, and it was founded on 11 November 1918.
Portugal is a country with 10.3 million inhabitants, its capital is Lisbon, and it was founded on 5 October 1143.
Romania is a country with 19 million inhabitants, its capital is Bucharest, and it was founded on 1 December 1918.
Russia is a country with 144 million inhabitants, its capital is Moscow, and it was founded on 12 June 1990.
San Marino is a country with 34,000 inhabitants, its capital is San Marino, and it was founded on 3 September 301.
Serbia is a country with 6.7 million inhabitants, its capital is Belgrade, and it was founded on 5 June 2006.
Slovakia is a country with 5.4 million inhabitants, its capital is Bratislava, and it was founded on 1 January 1993.
Slovenia is a country with 2.1 million inhabitants, its capital is Ljubljana, and it was founded on 25 June 1991.
Spain is a country with 47 million inhabitants, its capital is Madrid, and it was founded on 6 December 1978.
Sweden is a country with 10.4 million inhabitants, its capital is Stockholm, and it was founded on 6 June 1523.
Switzerland is a country with 8.3 million inhabitants, its capital is Bern, and it was founded on 12 September 1848.
Ukraine is a country with 41 million inhabitants, its capital is Kyiv, and it was founded on 24 August 1991.
United Kingdom is a country with 67 million inhabitants, its capital is London, and it was founded on 1 January 1801.
Vatican City is a country with 825 inhabitants, its capital is Vatican City, and it was founded on 11 February 1929.
"
library(tidyverse)
tibble(input = data_input) |> 
  separate_longer_delim(input, "\n") |> 
  filter(input != "") |> 
  mutate(
    country     = str_extract(input, "^.+?(?= is)"),
    inhabitants = str_extract(input, "\\d+?(?= million)"),
    capital     = str_extract(input, "(?<=capital is ).+?(?=\\,)"),
    founded     = str_extract(input, "\\d{1,2} [A-z]+ \\d{3,4}")
  ) |> 
  glimpse()
Rows: 46
Columns: 5
$ input       <chr> "Albania is a country with 2.8 million inhabitants, its ca…
$ country     <chr> "Albania", "Andorra", "Austria", "Belarus", "Belgium", "Bo…
$ inhabitants <chr> "8", NA, "9", "5", "5", "3", "9", "4", "2", "7", "8", "3",…
$ capital     <chr> "Tirana", "Andorra la Vella", "Vienna", "Minsk", "Brussels…
$ founded     <chr> "28 November 1912", "8 September 1278", "12 November 1918"…

What is wrong with this?

Longer Example (fixed)

country_df <- tibble(input = data_input) |> 
  separate_longer_delim(input, "\n") |> 
  filter(input != "") |> 
  mutate(
    country     = str_extract(input, "^.+?(?= is)"),
    inhabitants = str_extract(input, "\\d+? million|\\d+\\.\\d+? million|[0-9,]+?(?= inhabitants)"),
    capital     = str_extract(input, "(?<=capital is ).+?(?=\\,)"),
    founded     = str_extract(input, "\\d{1,2} [A-z]+ \\d{3,4}")
  ) |> 
  mutate(
    inhabitants = str_replace(inhabitants, "\\.(\\d) million", "\\100000"),
    inhabitants = str_replace(inhabitants, " million", "000000"),
    inhabitants = str_remove(inhabitants, ","),
    inhabitants = as.integer(inhabitants),
    founded     = lubridate::dmy(founded)
  ) |> 
  select(-input)
glimpse(country_df)
Rows: 46
Columns: 4
$ country     <chr> "Albania", "Andorra", "Austria", "Belarus", "Belgium", "Bo…
$ inhabitants <int> 2800000, 77000, 8900000, 9500000, 11500000, 3300000, 69000…
$ capital     <chr> "Tirana", "Andorra la Vella", "Vienna", "Minsk", "Brussels…
$ founded     <date> 1912-11-28, 1278-09-08, 1918-11-12, 1991-08-25, 1830-10-0…

Exercises 3

  1. replace all whitespace characters in the country columns of country_df with underscores (_)
  2. remove all punctuation from the data_input vector
  3. Extract the date from the string and turn it into a proper date vector:
string <- "Military defeats following the outbreak of the French Revolutionary Wars resulted in the insurrection of 10 August 1792. The monarchy was abolished and replaced by the French First Republic one month later."
  1. Check which of the lines in data_input have the word “million” (hint: you need to split the string into a vector with str_split() first)
  2. Now save a subset of the lines into a new object
  3. From the following vector, I want you to write code that identifies the URLs from the German Wikipedia that do not use the secure Hypertext Transfer Protocol (HTTPS):
vec <- c(
  "https://en.wikipedia.org/wiki/France",
  "https://en.wikipedia.org/wiki/Germany",
  "https://de.wikipedia.org/wiki/Deutschland",
  "http://de.wikipedia.org/wiki/Frankreich",
  "https://de.wikipedia.org/wiki/Belgien",
  "https://de.wikipedia.org/wiki/Bulgarien",
  "http://de.wikipedia.org/wiki/K%C3%B6nigreich_D%C3%A4nemark",
  "https://de.wikipedia.org/wiki/Slowenien",
  "https://de.wikipedia.org/wiki/Rum%C3%A4nien"
)

Tables, Tables, Tables

Country data in a data.frame

  • different types of information, clearly labelled as such

  • each line refers to an observation, each column contains a different variable

  • the variables are stored in different data types:

    typeof(country_df$country)
    [1] "character"
    typeof(country_df$inhabitants)
    [1] "integer"
    typeof(country_df$founded)
    [1] "double"
  • most statistical tools are designed to process data in tables (especially in the tidyverse)

country_df
# A tibble: 46 × 4
   country                inhabitants capital          founded   
   <chr>                        <int> <chr>            <date>    
 1 Albania                    2800000 Tirana           1912-11-28
 2 Andorra                      77000 Andorra la Vella 1278-09-08
 3 Austria                    8900000 Vienna           1918-11-12
 4 Belarus                    9500000 Minsk            1991-08-25
 5 Belgium                   11500000 Brussels         1830-10-04
 6 Bosnia and Herzegovina     3300000 Sarajevo         1992-03-01
 7 Bulgaria                   6900000 Sofia            1908-09-22
 8 Croatia                    4000000 Zagreb           1991-06-25
 9 Cyprus                     1200000 Nicosia          1960-08-16
10 Czech Republic            10700000 Prague           1993-01-01
# ℹ 36 more rows

Country data to plots

country_df |> 
  count(founded = year(founded)) |> 
  arrange(founded) |> 
  mutate(countries = cumsum(n)) |> 
  ggplot(aes(x = founded, y = countries)) +
  geom_line() +
  labs(x = NULL, y = NULL, title = "Number of countries in Europe over time*",
       caption = "*Dissolved countries were ignored")

country_df |> 
  mutate(country = fct_reorder(country, inhabitants)) |> 
  ggplot(aes(x = inhabitants, y = country)) +
  geom_col() +
  labs(x = NULL, y = NULL, title = "Largest countries in Europe*",
       caption = "*By inhabitants")

Accessing Data

Say you want the inhabitants for row 45 (United Kingdom)

country_df |> 
  slice(45) |> 
  pull(inhabitants)
[1] 67000000

Or better:

country_df |> 
  filter(country == "United Kingdom") |> 
  pull(inhabitants)
[1] 67000000

Or maybe you want the entire row:

country_df |> 
  filter(country == "United Kingdom") 
# A tibble: 1 × 4
  country        inhabitants capital founded   
  <chr>                <int> <chr>   <date>    
1 United Kingdom    67000000 London  1801-01-01

Or all rows of countries older than 1300:

country_df |> 
  filter(founded < "1300-01-01") 
# A tibble: 4 × 4
  country    inhabitants capital          founded   
  <chr>            <int> <chr>            <date>    
1 Andorra          77000 Andorra la Vella 1278-09-08
2 Monaco           39000 Monaco           1297-01-08
3 Portugal       3000000 Lisbon           1143-10-05
4 San Marino       34000 San Marino       301-09-03 

Updating Data

Let’s say you don’t like that the founding of France is set to the founding of the French Republic (22 September 1792) and you want to use the date of the Treaty of Verdun (10 August 843) instead:

France <- tibble(country = "France", 
                 inhabitants = 67000000L, 
                 capital = "Paris", 
                 founded = as.Date("843-08-10"))
country_df |> 
  rows_update(France, by = "country") |> 
  filter(founded < "1300-01-01") 
# A tibble: 5 × 4
  country    inhabitants capital          founded   
  <chr>            <int> <chr>            <date>    
1 Andorra          77000 Andorra la Vella 1278-09-08
2 France        67000000 Paris            843-08-10 
3 Monaco           39000 Monaco           1297-01-08
4 Portugal       3000000 Lisbon           1143-10-05
5 San Marino       34000 San Marino       301-09-03 

Adding Data

What if a new country was founded in Europe today?

essland <- tibble(country = "Essex Summer School", 
                  inhabitants = 38 * 15, 
                  capital = "Colchester Campus", 
                  founded = Sys.Date())
country_df |> 
  add_case(essland) |> 
  arrange(desc(founded)) |> 
  head()
# A tibble: 6 × 4
  country             inhabitants capital           founded   
  <chr>                     <dbl> <chr>             <date>    
1 Essex Summer School         570 Colchester Campus 2024-07-17
2 Kosovo                  8000000 Pristina          2008-02-17
3 Serbia                  7000000 Belgrade          2006-06-05
4 Montenegro               622000 Podgorica         2006-06-03
5 Czech Republic          7000000 Prague            1993-01-01
6 Slovakia                4000000 Bratislava        1993-01-01

Deleting Data

Now say you are only interested in countries with more than one million inhabitants. You can delete superfluos data with:

country_df_clean <- country_df |> 
  filter(inhabitants >= 1000000)

Exercises 4

  1. From the airports table in the nycflights13 package, access all airports that are in the Los Angeles time zone:
airports <- nycflights13::airports
  1. Access the faa of the airport at latitude 40.87522 and longitude -74.28136
  2. The Las Vegas International Aiport (LAS) has been renamed since the dataset was created. Update the name to “Harry Reid International”
  3. Delete all aiports that have NA as their timezone

Bad and good table structures

Grow tables down, not sideways

  • Wide tables are often seen because it makes data entry convenient for humans
  • But they cause many issues when processing the data computationally
bad_table <- tibble(
  country = c("Switzerland", "Austria"),
  pop1950 = c(4.7, 6.9),
  pop1960 = c(5.3, 7.1),
  pop1970 = c(6.2, 7.5)
)
bad_table
# A tibble: 2 × 4
  country     pop1950 pop1960 pop1970
  <chr>         <dbl>   <dbl>   <dbl>
1 Switzerland     4.7     5.3     6.2
2 Austria         6.9     7.1     7.5

Grow tables down, not sideways: sideways are difficult

Mean population across variables:

bad_table |> 
  summarise(
    pop_mean_1950 = mean(pop1950),
    pop_mean_1960 = mean(pop1960),
    pop_mean_1970 = mean(pop1970)
  )
# A tibble: 1 × 3
  pop_mean_1950 pop_mean_1960 pop_mean_1970
          <dbl>         <dbl>         <dbl>
1           5.8           6.2          6.85

Mean population overall:

bad_table |> 
  summarise(
    pop_mean = mean(c(pop1950, pop1960, pop1970))
  )
# A tibble: 1 × 1
  pop_mean
     <dbl>
1     6.28

Plot population vs country:

ggplot(bad_table) +
  geom_point(aes(x = pop1950, y = country), colour = "black") +
  geom_point(aes(x = pop1960, y = country), colour = "grey") +
  geom_point(aes(x = pop1970, y = country), colour = "green")

  • now imagine there are 50 or 100 years in that observation…

Grow tables down, not sideways

  • Every row now keeps an observation, each column represents a different variable
  • Rather than in the column header, year values are now their own variable
  • This form of a table is called a long table
  • Easy to add or remove observations
  • Most statistical tools are built for this table
  • Operations in R (and especially in the tidyverse) are now much easier
good_table <- tibble(
  country = c(rep("Switzerland", 3), rep("Austria", 3)),
  year = c(rep(c(1950, 1960, 1970), 2)),
  population = c(4.7, 5.3, 6.2, 6.9, 7.1, 7.5)
)
good_table
# A tibble: 6 × 3
  country      year population
  <chr>       <dbl>      <dbl>
1 Switzerland  1950        4.7
2 Switzerland  1960        5.3
3 Switzerland  1970        6.2
4 Austria      1950        6.9
5 Austria      1960        7.1
6 Austria      1970        7.5

Source: Weidmann (2023)

Grow tables down, not sideways: it makes computation easier

Mean population across variables:

good_table |> 
  group_by(year) |> 
  summarise(
    pop_mean = mean(population)
  )
# A tibble: 3 × 2
   year pop_mean
  <dbl>    <dbl>
1  1950     5.8 
2  1960     6.2 
3  1970     6.85

Mean population overall:

good_table |> 
  summarise(
    pop_mean = mean(population)
  )
  pop_mean
1 6.283333

Plot population vs country:

ggplot(good_table) +
  geom_point(aes(x = population, y = country, colour = as.factor(year)))

Avoid reduncancy

  • Some variables are connected to groups and never (or almost never) change
  • The constant repetition introduces redundant information in our dataset
  • Depending on the size of our data, this can lead to increased processing time, a larger memory footpint, and more demand for storage
  • Conceptually redundancy is introduced when we store information about different entities (that refer to each other) in a single table (here countries and population)
bad_table2 <- tibble(
  country = c(rep("Switzerland", 3), rep("Austria", 3)),
  year = c(rep(c(1950, 1960, 1970), 2)),
  population = c(4.7, 5.3, 6.2, 6.9, 7.1, 7.5),
  capital = c("Bern", "Bern", "Bern", "Vienna", "Vienna", "Vienna")
)
bad_table2
# A tibble: 6 × 4
  country      year population capital
  <chr>       <dbl>      <dbl> <chr>  
1 Switzerland  1950        4.7 Bern   
2 Switzerland  1960        5.3 Bern   
3 Switzerland  1970        6.2 Bern   
4 Austria      1950        6.9 Vienna 
5 Austria      1960        7.1 Vienna 
6 Austria      1970        7.5 Vienna 

Source: Weidmann (2023)

Avoid reduncancy: Store entities in different tables

population-per-year information

good_table_population <- bad_table2 |> 
  select(-capital)
good_table_population
# A tibble: 6 × 3
  country      year population
  <chr>       <dbl>      <dbl>
1 Switzerland  1950        4.7
2 Switzerland  1960        5.3
3 Switzerland  1970        6.2
4 Austria      1950        6.9
5 Austria      1960        7.1
6 Austria      1970        7.5

country-level information

good_table_countires <- bad_table2 |> 
  select(country, capital) |> 
  distinct()
good_table_countires
# A tibble: 2 × 2
  country     capital
  <chr>       <chr>  
1 Switzerland Bern   
2 Austria     Vienna 

Avoid reduncancy: When not to

  • During data processing and management
  • During data analysis (at least not strictly)
  • During analysis, you often want to compare, contrast, connect entities, so you need data about all of them
  • Create “analysis datasets” that contain all the data that you currently need
analysis_datasets <- merge(good_table_population, good_table_countires, by = "country")
analysis_datasets
      country year population capital
1     Austria 1950        6.9  Vienna
2     Austria 1960        7.1  Vienna
3     Austria 1970        7.5  Vienna
4 Switzerland 1950        4.7    Bern
5 Switzerland 1960        5.3    Bern
6 Switzerland 1970        6.2    Bern

Luke Chesser via unsplash.com

Tidying data

Tidy Data

“Happy families are all alike; every unhappy family is unhappy in its own way.”
— Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.”
— Hadley Wickham

Tidyverse got its name from the idea of tidy data, which is defined as (Wickham 2014):

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each value is a cell; each cell is a single value.
  4. (Each type of observational unit forms a table.)

Any other form of data is by definition messy data. This is essentially the same definition like we’ve seen above: tidy data = good table.

Indentify the tidy Data

table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Reshaping

From wide to long:

tidy_table <- bad_table |> 
  pivot_longer(
    cols = starts_with("pop"), # define which columns this applies to
    names_to = "year",         # name the new variable containing the column names
    values_to = "population"   # name the new variable containing the cell values
  ) |> 
  mutate(year = as.integer(str_extract(year, "\\d+")))

As said above, this is exactly the same principle we looked at for a good table:

all.equal(tidy_table, good_table)
[1] TRUE

How does pivoting work?

df <- tribble(
  ~id,  ~bp1, ~bp2,
   "A",  100,  120,
   "B",  140,  115,
   "C",  120,  125
)
df |> 
  pivot_longer(
    cols = bp1:bp2,
    names_to = "measurement",
    values_to = "value"
  )
# A tibble: 6 × 3
  id    measurement value
  <chr> <chr>       <dbl>
1 A     bp1           100
2 A     bp2           120
3 B     bp1           140
4 B     bp2           115
5 C     bp1           120
6 C     bp2           125

Source: Wickham, Çetinkaya-Rundel, and Grolemund (2023)

More complex reshaping

household
# A tibble: 5 × 5
  family dob_child1 dob_child2 name_child1 name_child2
   <int> <date>     <date>     <chr>       <chr>      
1      1 1998-11-26 2000-01-29 Susan       Jose       
2      2 1996-06-22 NA         Mark        <NA>       
3      3 2002-07-11 2004-04-05 Sam         Seth       
4      4 2004-10-10 2009-08-27 Craig       Khai       
5      5 2000-12-05 2005-02-28 Parker      Gracie     
household |> 
  pivot_longer(
    cols = !family, 
    names_to = c(".value", "child"), 
    names_sep = "_", 
    values_drop_na = TRUE
  )
# A tibble: 9 × 4
  family child  dob        name  
   <int> <chr>  <date>     <chr> 
1      1 child1 1998-11-26 Susan 
2      1 child2 2000-01-29 Jose  
3      2 child1 1996-06-22 Mark  
4      3 child1 2002-07-11 Sam   
5      3 child2 2004-04-05 Seth  
6      4 child1 2004-10-10 Craig 
7      4 child2 2009-08-27 Khai  
8      5 child1 2000-12-05 Parker
9      5 child2 2005-02-28 Gracie

pivot_wider

Not as common, but sometimes one or two columns contain several variables:

cms_patient_experience
# A tibble: 500 × 5
   org_pac_id org_nm                           measure_cd measure_title prf_rate
   <chr>      <chr>                            <chr>      <chr>            <dbl>
 1 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       63
 2 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       87
 3 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       86
 4 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       57
 5 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       85
 6 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       24
 7 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI…       59
 8 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI…       85
 9 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI…       83
10 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI…       63
# ℹ 490 more rows

In this case, we need to do the opposite operations: widening data

good_table |> 
  pivot_wider(
    id_cols = country, 
    names_from = year, 
    values_from = population, 
    names_prefix = "pop"
  )
# A tibble: 2 × 4
  country     pop1950 pop1960 pop1970
  <chr>         <dbl>   <dbl>   <dbl>
1 Switzerland     4.7     5.3     6.2
2 Austria         6.9     7.1     7.5

We have reconstructed bad table 😓

-> What is a variable and what is an observation depends on the questions you ask

pivot_wider: fix patient data

cms_patient_experience |> 
  pivot_wider(
    id_cols = starts_with("org"), # we use both columns as id just to not widen them
    names_from = measure_cd,
    values_from = prf_rate
  )
# A tibble: 95 × 8
   org_pac_id org_nm CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5 CAHPS_GRP_8
   <chr>      <chr>        <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
 1 0446157747 USC C…          63          87          86          57          85
 2 0446162697 ASSOC…          59          85          83          63          88
 3 0547164295 BEAVE…          49          NA          75          44          73
 4 0749333730 CAPE …          67          84          85          65          82
 5 0840104360 ALLIA…          66          87          87          64          87
 6 0840109864 REX H…          73          87          84          67          91
 7 0840513552 SCL H…          58          83          76          58          78
 8 0941545784 GRITM…          46          86          81          54          NA
 9 1052612785 COMMU…          65          84          80          58          87
10 1254237779 OUR L…          61          NA          NA          65          NA
# ℹ 85 more rows
# ℹ 1 more variable: CAHPS_GRP_12 <dbl>

Separating columns

table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583
table3 |> 
  separate(col = rate, into = c("cases", "population"), sep = "/")
# A tibble: 6 × 4
  country      year cases  population
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Exercises 5

  1. Look at the table billboard that is included in tidyr. What is not tidy about it?
billboard
# A tibble: 317 × 79
   artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
   <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
 2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
 3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
 4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
 5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
 6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
 7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
 8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
 9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
# ℹ 307 more rows
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
  1. Tidy the data
  2. The new table contains many missing values (NAs) explain why and if you should delete them
  3. Independently from whether it is the right thing to do or now: delete the NAs
  4. Why does this code fail?
table4a |>  
  pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")
Error in `pivot_longer()`:
! Can't select columns past the end.
ℹ Locations 1999 and 2000 don't exist.
ℹ There are only 3 columns.
  1. Make this into a tidy table
preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12
)

Unnesting & Plucking

The problem

A huge problem in web scraping: nested lists:

  • tables are not always the best option for data storage or transfer
  • many APIs work with JSON files, which can be endlessly nested structures
  • bringing these into a tidy form is almost never straightforward
  • imagine you’re searching an object hidden within the smallest matryoshka in an matryoshka store

Syria Polidoro via unsplash.com

Unnesting

For simple operations, unnesting, unlisting and binding can help:

sports_results <- list(
  name = "Alice",
  age = 25,
  scores = list(list(game1 = 85, game2 = 90, game3 = 88)),
  details = list(list(address = "123 Main St", city = "Anytown"))
)
lobstr::tree(sports_results)
<list>
├─name: "Alice"
├─age: 25
├─scores: <list>
│ └─<list>
│   ├─game1: 85
│   ├─game2: 90
│   └─game3: 88
└─details: <list>
  └─<list>
    ├─address: "123 Main St"
    └─city: "Anytown"

Unnesting:

sports_results_df <- as_tibble(sports_results) |> 
  unnest_longer(scores) |> 
  unnest_longer(details)
sports_results_df
# A tibble: 6 × 6
  name    age scores scores_id details     details_id
  <chr> <dbl>  <dbl> <chr>     <chr>       <chr>     
1 Alice    25     85 game1     123 Main St address   
2 Alice    25     85 game1     Anytown     city      
3 Alice    25     90 game2     123 Main St address   
4 Alice    25     90 game2     Anytown     city      
5 Alice    25     88 game3     123 Main St address   
6 Alice    25     88 game3     Anytown     city      

We can now use our skills to tidy this up:

players <- sports_results_df |> 
  pivot_wider(id_cols = c(name, age, scores_id, scores), names_from = details_id, values_from = details) |> 
  select(-scores_id, -scores)
players
# A tibble: 3 × 4
  name    age address     city   
  <chr> <dbl> <chr>       <chr>  
1 Alice    25 123 Main St Anytown
2 Alice    25 123 Main St Anytown
3 Alice    25 123 Main St Anytown
games <- sports_results_df |> 
  mutate(game = str_extract(scores_id, "\\d+")) |> 
  select(name, scores, game)
games
# A tibble: 6 × 3
  name  scores game 
  <chr>  <dbl> <chr>
1 Alice     85 1    
2 Alice     85 1    
3 Alice     90 2    
4 Alice     90 2    
5 Alice     88 3    
6 Alice     88 3    

Unlisting:

tibble(
  name = sports_results$name,
  age = sports_results$age,
  game = names(unlist(sports_results$scores)),
  scores = unlist(sports_results$scores)
)
# A tibble: 3 × 4
  name    age game  scores
  <chr> <dbl> <chr>  <dbl>
1 Alice    25 game1     85
2 Alice    25 game2     90
3 Alice    25 game3     88

Binding is useful when the list already has the format of a data frame:

list(
  list(name = "Alice", age = 25),
  list(name = "John", age = 24)
) |> 
  bind_rows()
# A tibble: 2 × 2
  name    age
  <chr> <dbl>
1 Alice    25
2 John     24

Or if the list is really a list of data frames:

list(
  tibble(name = "Alice", age = 25),
  tibble(name = "John", age = 24)
) |> 
  bind_rows()
# A tibble: 2 × 2
  name    age
  <chr> <dbl>
1 Alice    25
2 John     24

Plucking

When nothing else helps, we need to pluck the values we want from the list:

sports_results$details[[1]]$address
[1] "123 Main St"

The tidyverse has a function for this:

sports_results |> 
  pluck("details", 1, "address")
[1] "123 Main St"

It’s main advantage is that you can define what happens if the object is not found:

sports_results |> 
  pluck("details", 1, "telephone", .default = NA_character_)
[1] NA

Twitter data case study

I pulled this messy data off the ESS Twitter page:

ess_tweets <- readRDS("data/nested_ess.rds")

Let’s try and have a look:

ess_tweets |> 
  lobstr::tree(max_length = 25)
<list>
└─data: <list>
  └─user: <list>
    └─result: <list>
      ├─__typename: "User"
      └─timeline_v2: <list>
        └─timeline: <list>
          ├─instructions: <list>
          │ ├─<list>
          │ │ └─type: "TimelineClearCache"
          │ ├─<list>
          │ │ ├─type: "TimelinePinEntry"
          │ │ └─entry: <list>
          │ │   ├─entryId: "tweet-1765314896617427390"
          │ │   ├─sortIndex: "1813597028811276288"
          │ │   └─content: <list>
          │ │     ├─entryType: "TimelineTimelineItem"
          │ │     ├─__typename: "TimelineTimelineItem"
          │ │     ├─itemContent: <list>...
          │ │     └─clientEventInfo: <list>...
          │ └─<list>
          │   ├─type: "TimelineAddEntries"
          │   └─entries: <list>
          │     ├─<list>
          │     │ ├─entryId: "profile-conversation-18135970288..."
... 
View(ess_tweets)

Twitter data case study: searching the store

Since we already know that the data contains the latest Tweets, we can use a function I wrote to search for it in the data:

parse_path <- function(ix) {
  out <- as.list(ix$p)
  out[which(ix$p == as.character(ix$pos))] <- ix$pos[ix$p == as.character(ix$pos)]
  gsub("list(", "purrr::pluck(DATA, ", deparse1(out), fixed = TRUE)
}

#' Search a list
#'
#' @param l a list
#' @param f a function to identify the element you are searching
#'
#' @return an object containing the searched element with the function to extract it as a name
#' @export
list_search <- function(l, f) {
  
  paths <- rrapply::rrapply(
    object = l,
    condition = f,
    f = function(x, .xparents, .xname, .xpos) list(p = .xparents, n = .xname, pos = .xpos),
    how = "flatten"
  )
  
  out <- purrr::map(paths, function(p) purrr::pluck(l, !!!p$pos))
  names(out) <- purrr::map_chr(paths, parse_path)
  return(out)
}
list_search(ess_tweets, function(x) str_detect(x, "Only one week to go until all our session three courses close!"))
$`purrr::pluck(DATA, "data", "user", "result", "timeline_v2", "timeline", "instructions", 3L, "entries", 2L, "content", "itemContent", "tweet_results", "result", "legacy", "full_text")`
[1] "RT @EssexSumSchool: Only one week to go until all our session three courses close! \n\nMost of these courses take place from 5 – 16 August an…"

$`purrr::pluck(DATA, "data", "user", "result", "timeline_v2", "timeline", "instructions", 3L, "entries", 2L, "content", "itemContent", "tweet_results", "result", "legacy", "retweeted_status_result", "result", "legacy", "full_text")`
[1] "Only one week to go until all our session three courses close! \n\nMost of these courses take place from 5 – 16 August and there are 11 courses available including machine learning for different types of data and agent-based models.\n\nFind out more: https://t.co/ThWnkYp0ye\n#ESS2024 https://t.co/fsve1tEdhd"

$`purrr::pluck(DATA, "data", "user", "result", "timeline_v2", "timeline", "instructions", 3L, "entries", 3L, "content", "itemContent", "tweet_results", "result", "legacy", "full_text")`
[1] "Only one week to go until all our session three courses close! \n\nMost of these courses take place from 5 – 16 August and there are 11 courses available including machine learning for different types of data and agent-based models.\n\nFind out more: https://t.co/ThWnkYp0ye\n#ESS2024 https://t.co/fsve1tEdhd"

Twitter data case study: getting the right doll

Going one level up from this tweet might contain more useful information:

pluck(ess_tweets, "data", "user", "result", "timeline_v2", "timeline", "instructions", 3L, "entries", 3L, "content", "itemContent", "tweet_results", "result", "legacy") |> 
  lobstr::tree(max_length = 25)
<list>
├─bookmark_count: 0
├─bookmarked: FALSE
├─created_at: "Mon Jul 15 11:04:53 +0000 2024"
├─conversation_id_str: "1812805529590218988"
├─display_text_range: <list>
│ ├─0
│ └─279
├─entities: <list>
│ ├─hashtags: <list>
│ │ └─<list>
│ │   ├─indices: <list>
│ │   │ ├─271
│ │   │ └─279
│ │   └─text: "ESS2024"
│ ├─media: <list>
│ │ └─<list>
│ │   ├─display_url: "pic.x.com/fsve1tedhd"
│ │   ├─expanded_url: "https://twitter.com/EssexSumScho..."
│ │   ├─id_str: "1812796825721593856"
│ │   ├─indices: <list>
│ │   │ ├─280
│ │   │ └─303
│ │   ├─media_key: "3_1812796825721593856"
│ │   ├─media_url_https: "https://pbs.twimg.com/media/GShZ..."
... 

Looks like this is the entire tweet. Additionally, the part of the path called “entries” might suggest we can just extract all tweets relatively easily:

entries <- pluck(ess_tweets, "data", "user", "result", "timeline_v2", "timeline", "instructions", 3L, "entries")
lobstr::tree(entries, max_depth = 3, max_length = 25)
<list>
├─<list>
│ ├─entryId: "profile-conversation-18135970288..."
│ ├─sortIndex: "1813597028811276287"
│ └─content: <list>
│   ├─entryType: "TimelineTimelineModule"
│   ├─__typename: "TimelineTimelineModule"
│   ├─items: <list>...
│   ├─metadata: <list>
│   │ └─conversationMetadata: <list>...
│   ├─displayType: "VerticalConversation"
│   └─clientEventInfo: <list>...
├─<list>
│ ├─entryId: "tweet-1813535484532125895"
│ ├─sortIndex: "1813597028811276286"
│ └─content: <list>
│   ├─entryType: "TimelineTimelineItem"
│   ├─__typename: "TimelineTimelineItem"
│   ├─itemContent: <list>...
│   └─clientEventInfo: <list>...
├─<list>
│ ├─entryId: "tweet-1812805529590218988"
│ ├─sortIndex: "1813597028811276285"
│ └─content: <list>
│   ├─entryType: "TimelineTimelineItem"
... 

Twitter data case study: cleaning up the mess

It seems all of these entries are structured in the same way, which is a good sign. So hopefully we can just extract the content from the same position in all of them:

tweets <- map(entries, function(x) pluck(x, "content", "itemContent", "tweet_results", "result", "legacy"))
# alternatively, this produces the same outcome
# tweets <- map(entries, c("content", "itemContent", "tweet_results", "result", "legacy"))
lobstr::tree(tweets, max_depth = 2, max_length = 25)
<list>
├─<NULL>
├─<list>
│ ├─bookmark_count: 0
│ ├─bookmarked: FALSE
│ ├─created_at: "Wed Jul 17 11:25:28 +0000 2024"
│ ├─conversation_id_str: "1813535484532125895"
│ ├─display_text_range: <list>...
│ ├─entities: <list>...
│ ├─favorite_count: 0
│ ├─favorited: FALSE
│ ├─full_text: "RT @EssexSumSchool: Only one wee..."
│ ├─is_quote_status: FALSE
│ ├─lang: "en"
│ ├─quote_count: 0
│ ├─reply_count: 0
│ ├─retweet_count: 2
│ ├─retweeted: FALSE
│ ├─user_id_str: "1525016244"
│ ├─id_str: "1813535484532125895"
│ └─retweeted_status_result: <list>
│   └─result: <list>...
├─<list>
│ ├─bookmark_count: 0
│ ├─bookmarked: FALSE
... 

Yes, these contain tweets! Although not all entries are actually populated with data, it seems. Nevertheless, we can turn this into a tidy table:

tweets_df <- map(tweets, function(t) {
  tibble(
    id = t$id_str,
    user_id = t$user_id_str,
    created_at = t$created_at,
    full_text = t$full_text,
    favorite_count = t$favorite_count,
    retweet_count = t$retweet_count,
    bookmark_count = t$bookmark_count
  )
}) |> 
  bind_rows()

glimpse(tweets_df)
Rows: 16
Columns: 7
$ id             <chr> "1813535484532125895", "1812805529590218988", "18127957…
$ user_id        <chr> "1525016244", "1525016244", "1525016244", "1525016244",…
$ created_at     <chr> "Wed Jul 17 11:25:28 +0000 2024", "Mon Jul 15 11:04:53 …
$ full_text      <chr> "RT @EssexSumSchool: Only one week to go until all our …
$ favorite_count <int> 0, 0, 0, 0, 0, 6, 5, 0, 2, 6, 0, 0, 2, 8, 0, 2
$ retweet_count  <int> 2, 2, 7, 2, 1, 1, 2, 1, 1, 5, 11, 7, 1, 4, 4, 1
$ bookmark_count <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 2, 0, 0, 0

Exercises 6

  1. The object below is a nested list with Game of Thrones characters. Tidy the data into a data.frame that contains at least name, gender, culture and birthday of each character
got_characters <- repurrrsive::got_chars
  1. From the got_characters object, also extract in which season(s) they appear, so that the data frame contains one row per season appearance (and still the same character information as above)

  2. Sometimes, it makes sense to keep nested information around for later. The got_characters object contains aliases for the characters. Store them along the character name in a nested format, so that each row looks like this:

tibble(
  name = "Theon Greyjoy",
  aliases = list(c("Prince of Fools", "Theon Turncloak", "Reek", "Theon Kinslaye"))
)
# A tibble: 1 × 2
  name          aliases  
  <chr>         <list>   
1 Theon Greyjoy <chr [4]>
  1. The object below contains someone’s music collection. Create a tidy table of the artist this person listens to:
music_collection <- repurrrsive::discog
  1. Now add the title of the songs to the artists table

Reminder: Social Programme

DATE Event Time Venue
MONDAY 7 July Meet and Greet - in person 19:00 start SU Bar
TUESDAY 8 July Climbing 18:30 start Sports Centre
WEDNESDAY 9 July Harold Clarke Speaker Series - hybrid 18:45 - 20.00 EBS
THURSDAY 10 July Sports Night 18:30 - 20:30 Sports Centre
FRIDAY 11 July Wivenhoe Pub Run 18:30 start Wivenhoe pubs
MONDAY 14 JULY SU bar Quiz 19:00 start SU Bar
TUESDAY 15 JULY Sports Night 18:30 - 20:30 Sports Centre
WEDNESDAY 16 JULY Harold Clarke Speaker Series - hybrid 18:30 EBS
THURSDAY 17 JULY Farewell Party Karaoke 20:30 - 23:30 SU Bar

Wrap Up

Save some information about the session for reproducibility.

Show Session Info
sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.12.0 
LAPACK: /usr/lib/liblapack.so.3.12.0  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.4 forcats_1.0.0   dplyr_1.1.4     purrr_1.0.4    
 [5] readr_2.1.5     tidyr_1.3.1     tibble_3.3.0    ggplot2_3.5.1  
 [9] tidyverse_2.0.0 stringr_1.5.1   tinytable_0.8.0

loaded via a namespace (and not attached):
 [1] gtable_0.3.6      jsonlite_2.0.0    compiler_4.5.1    tidyselect_1.2.1 
 [5] scales_1.3.0      yaml_2.3.10       fastmap_1.2.0     R6_2.6.1         
 [9] generics_0.1.3    knitr_1.50        munsell_0.5.1     pillar_1.10.2    
[13] tzdb_0.5.0        rlang_1.1.6       stringi_1.8.7     litedown_0.7     
[17] xfun_0.52         timechange_0.3.0  cli_3.6.5         withr_3.0.2      
[21] magrittr_2.0.3    digest_0.6.37     grid_4.5.1        rstudioapi_0.17.1
[25] hms_1.1.3         lifecycle_1.0.4   vctrs_0.6.5       evaluate_1.0.3   
[29] glue_1.8.0        codetools_0.2-20  fansi_1.0.6       colorspace_2.1-1 
[33] rmarkdown_2.29    tools_4.5.1       pkgconfig_2.0.3   htmltools_0.5.8.1
.table-striped {
  > tbody > tr:nth-of-type(odd) > * {
    background-color: #fff9ce;
  }
}
.table-hover {
  > tbody > tr:hover > * {
    background-color: #ffe99e; /* Adjust this color as needed */
  }
}

References

Atteveldt, Wouter van, Damian Trilling, and Carlos Arcíla. 2021. Computational Analysis of Communication: A Practical Introduction to the Analysis of Texts, Networks, and Images with Code Examples in Python and R. Hoboken, NJ: John Wiley & Sons. https://cssbook.net.
Hase, Valerie. 2021. Text as Data Methods in R - Applications for Automated Analyses of News Content. https://bookdown.org/valerie_hase/TextasData_HS2021/.
Weidmann, Nils B. 2023. Data Management for Social Scientists: From Files to Databases. 1st ed. Cambridge University Press. https://doi.org/10.1017/9781108990424.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). https://doi.org/10.18637/jss.v059.i10.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd edition. Beijing Boston Farnham Sebastopol Tokyo: O’Reilly.